devtools::install_github("ZheyuanLi/SplinesUtils")
install.packages(c("dplyr", "ggplot2", "patchwork", "splines", "blmeco", "mgcv", "dlnm",
"plotly"))
library(dplyr)
library(ggplot2)
library(patchwork)
library(splines)
library(blmeco)
library(mgcv)
library(dlnm)
library(SplinesUtils)
library(plotly)
Triceps and subscapular skinfold thicknesses provide an index of body fat and midarm muscle circumference provides a measure of muscle mass. The dataset contains the age in years, the intriceps and triceps skinfold thickness in cm. In the first part of this practical we would like to examine the association between age and triceps skinfold thickness.
As a first step we load the data.
triceps <- read.csv(".../triceps.csv")
Question 1.1 Perform the function to the dataset to understand the data structure. Fit a regression model (Normal, Poisson or logistic depending on the nature of the main outcome) to examine the linear association between triceps skinfold thickness and age. Provide an interpretation of the result.
Reply: For each additional year in the age there is a 0.2 (0.20, 0.24) cm increase in the triceps skinfold thickness
triceps %>% head()
triceps %>% ggplot() + geom_histogram(aes(x=triceps)) + theme_bw()
lm(triceps~age, data = triceps) -> mod_linear
mod_linear %>% summary()
##
## Call:
## lm(formula = triceps ~ age, data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9512 -2.3965 -0.5154 1.5822 25.1233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.19717 0.21244 29.17 <2e-16 ***
## age 0.21584 0.01014 21.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.007 on 890 degrees of freedom
## Multiple R-squared: 0.3372, Adjusted R-squared: 0.3365
## F-statistic: 452.8 on 1 and 890 DF, p-value: < 2.2e-16
mod_linear %>% confint()
## 2.5 % 97.5 %
## (Intercept) 5.7802307 6.614116
## age 0.1959348 0.235750
Question 1.2 What is the main assumption regarding the shape of the relationship between triceps skinfold thickness and age in question 1.1? Can you check if this assumption is valid?
Reply: The main assumption is that the relationship between triceps skinfold thickness and age is linear. We can plot them in a scatterplot to see if this assumption is valid.
Having plotted the results, the main drawback of the approach in question 1.1 is that it misses the initial declining trend for the first 10 years of life.
Question 1.3 Fit 3 models with polynomial basis function of order 2, 3 and 4. Plot the results together and discuss. Perform a series of likelihood ratio test to examine which model fits the data best.
mod_po2 <- lm(triceps~age+I(age^2), data = triceps)
mod_po3 <- lm(triceps~age+I(age^2)+I(age^3), data = triceps)
mod_po4 <- lm(triceps~age+I(age^2)+I(age^3)+I(age^4), data = triceps)
data.frame(x=rep(triceps$age, times = 3),
fitted=c(mod_po2$fitted.values, mod_po3$fitted.values,
mod_po4$fitted.values),
pol = rep(c("Order 2", "Order 3", "Order 4"),
each = mod_po2$fitted.values %>% length())) %>%
ggplot() + geom_point(data = triceps, aes(x = age, y = triceps), cex = 1.2) +
geom_line(aes(x = x, y = fitted, col = pol), cex = 1.2) +
theme_bw()
It is interesting that the mod_po2 gives zero coefficient to the \(x^2\), with the effect being very similar with the linear approach.
mod_po2 %>% summary()
##
## Call:
## lm(formula = triceps ~ age + I(age^2), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5677 -2.4401 -0.4587 1.6368 24.9961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0229191 0.3063806 19.658 < 2e-16 ***
## age 0.2434733 0.0364403 6.681 4.17e-11 ***
## I(age^2) -0.0006257 0.0007926 -0.789 0.43
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.008 on 889 degrees of freedom
## Multiple R-squared: 0.3377, Adjusted R-squared: 0.3362
## F-statistic: 226.6 on 2 and 889 DF, p-value: < 2.2e-16
and the likelihood ratio tests:
anova(mod_po2, mod_po3)
anova(mod_po3, mod_po4)
anova(mod_po2, mod_po4)
The likelihood ratio test shows the need of a more flexible model, with degree 4 polynomial.
Question 1.4 Select the best performing model of the question 1.3 and plot the flexible fit together with the 95% confidence intervals. Hint: Use the function to get the standard error and compute the upper and lower limit of the confidence intervals.
Reply:
predict(mod_po4, newdata = data.frame(age = triceps$age), se.fit = TRUE) -> pred_p
data.frame(x = triceps$age,
mean = pred_p$fit,
UL = pred_p$fit + 1.96*pred_p$se.fit,
LL = pred_p$fit - 1.96*pred_p$se.fit
) %>%
ggplot() + geom_point(data = triceps, aes(x = age, y = triceps), cex = 1.2) +
geom_line(aes(x = x, y = mean), cex = 1.2) +
geom_ribbon(aes(x = x, ymin = LL, ymax = UL), fill = "blue", alpha = 0.2) +
theme_bw()
Question 1.5 Now use a Fourier basis function and fit three models with one sin/cos pair and period \(P=25,50,100\) years. Plot the results together and discuss.
Reply:
omega <- 2*pi/25
mod_fourier_p25 <- lm(triceps~I(sin(omega*age))+I(cos(omega*age)), data = triceps)
omega <- 2*pi/50
mod_fourier_p50 <- lm(triceps~I(sin(omega*age))+I(cos(omega*age)), data = triceps)
omega <- 2*pi/100
mod_fourier_p100 <- lm(triceps~I(sin(omega*age))+I(cos(omega*age)), data = triceps)
data.frame(x=rep(triceps$age, times = 3),
fitted=c(mod_fourier_p25$fitted.values, mod_fourier_p50$fitted.values, mod_fourier_p100$fitted.values),
pol = rep(c("P=25", "P=50", "P100"), each = mod_fourier_p25$fitted.values %>% length())) %>%
ggplot() + geom_point(data = triceps, aes(x = age, y = triceps), cex = 1.2) +
geom_line(aes(x = x, y = fitted, col = pol), cex = 1.2) +
theme_bw()
Question 2.1 Use the same dataset as before and now fit 2 linear threshold models (not linear splines!), the first with one threshold at 10 years, and the second with two thresholds one at 10 and one at 40 years. Plot the results together and discuss. How do you interpret the results for the different thresholds? Discuss potential limitations of this approach.
Reply:
mod_lt_1 <- lm(triceps~age:(age<10), data = triceps)
mod_lt_1 %>% summary()
##
## Call:
## lm(formula = triceps ~ age:(age < 10), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5362 -2.4105 -0.3888 1.4132 24.8899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.03377 0.28859 24.373 <2e-16 ***
## age:age < 10FALSE 0.19118 0.01161 16.464 <2e-16 ***
## age:age < 10TRUE -0.03042 0.05899 -0.516 0.606
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.969 on 889 degrees of freedom
## Multiple R-squared: 0.3503, Adjusted R-squared: 0.3489
## F-statistic: 239.7 on 2 and 889 DF, p-value: < 2.2e-16
predict(mod_lt_1, newdata = data.frame(age = triceps$age), se.fit = TRUE) -> pred_lt_1
data.frame(x = triceps$age,
mean = pred_lt_1$fit,
UL = pred_lt_1$fit + 1.96*pred_lt_1$se.fit,
LL = pred_lt_1$fit - 1.96*pred_lt_1$se.fit,
disc = ifelse(triceps$age<10, 1, 2)) %>%
ggplot() + geom_point(data = triceps, aes(x = age, y = triceps), cex = .7, alpha = 0.2) +
geom_line(aes(x = x, y = mean, group = disc), cex = 1.2) +
geom_ribbon(aes(x = x, ymin = LL, ymax = UL), fill = "blue", alpha = 0.2) +
theme_bw() -> p1
Follow similar procedure for the 2 thresholds:
mod_lt_2 <- lm(triceps~age:(age<10)+age:(age<40), data = triceps)
predict(mod_lt_2, newdata = data.frame(age = triceps$age), se.fit = TRUE) -> pred_lt_2
data.frame(x = triceps$age,
mean = pred_lt_2$fit,
UL = pred_lt_2$fit + 1.96*pred_lt_2$se.fit,
LL = pred_lt_2$fit - 1.96*pred_lt_2$se.fit,
disc = case_when(triceps$age<10~1,
triceps$age>=10 & triceps$age<40~2,
triceps$age>=40~3)) %>%
ggplot() + geom_point(data = triceps, aes(x = age, y = triceps), cex = .7, alpha = 0.2) +
geom_line(aes(x = x, y = mean, group = disc), cex = 1.2) +
geom_ribbon(aes(x = x, ymin = LL, ymax = UL), fill = "blue", alpha = 0.2) +
theme_bw() -> p2
plot together:
p1|p2
and the summary:
mod_lt_2 %>% summary()
##
## Call:
## lm(formula = triceps ~ age:(age < 10) + age:(age < 40), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5816 -2.4332 -0.4099 1.5267 24.6136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.72667 0.31208 21.554 <2e-16 ***
## age:age < 10FALSE 0.17842 0.01262 14.138 <2e-16 ***
## age:age < 10TRUE -0.01866 0.05899 -0.316 0.7519
## age:age < 40TRUE 0.03661 0.01441 2.540 0.0113 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.957 on 888 degrees of freedom
## Multiple R-squared: 0.355, Adjusted R-squared: 0.3528
## F-statistic: 162.9 on 3 and 888 DF, p-value: < 2.2e-16
mod_lt_2 %>% confint()
## 2.5 % 97.5 %
## (Intercept) 6.114155452 7.33917493
## age:age < 10FALSE 0.153654495 0.20319134
## age:age < 10TRUE -0.134441326 0.09712603
## age:age < 40TRUE 0.008319934 0.06490164
For the model with the 3 thresholds: For every year increase in the age, for people aged less than 10 years old, there weak evidence of a 0.19 cm decrease in the triceps skinfold thickness, for people aged between 10 and 40 years old, for every year increase in the age, there is a 0.17 cm increase in the triceps skinfold thickness and for people 40 years old or older, for every year increase in the age, there is a 0.04 cm increase in the triceps skinfold thickness.
The discontinuity on the threshold is a limitation of this approach, because it does not allow us to investigate the natural progress of the process.
Question 2.2 Now without using the function or any kind of splines related packages in R, fit linear splines with one threshold at 10 years. Plot, calculate the 95% confidence intervals, interpret the results and compare with the results of the question 2.1.
Reply:
mod_lsplines <- lm(triceps~age + I((age-10)*(age<10)), data = triceps)
predict(mod_lsplines, newdata = data.frame(age = triceps$age), se.fit = TRUE) -> pred_lsplines
data.frame(x = triceps$age,
mean = pred_lsplines$fit,
UL = pred_lsplines$fit + 1.96*pred_lsplines$se.fit,
LL = pred_lsplines$fit - 1.96*pred_lsplines$se.fit) %>%
ggplot() + geom_point(data = triceps, aes(x = age, y = triceps), cex = .7, alpha = 0.2) +
geom_line(aes(x = x, y = mean), cex = 1.2) +
geom_ribbon(aes(x = x, ymin = LL, ymax = UL), fill = "blue", alpha = 0.2) +
theme_bw()
mod_lsplines %>% summary()
##
## Call:
## lm(formula = triceps ~ age + I((age - 10) * (age < 10)), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9714 -1.9977 -0.6263 1.2370 25.3828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.74708 0.34338 13.825 < 2e-16 ***
## age 0.26452 0.01354 19.536 < 2e-16 ***
## I((age - 10) * (age < 10)) -0.30572 0.05740 -5.326 1.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.946 on 889 degrees of freedom
## Multiple R-squared: 0.3577, Adjusted R-squared: 0.3563
## F-statistic: 247.6 on 2 and 889 DF, p-value: < 2.2e-16
mod_lsplines %>% confint()
## 2.5 % 97.5 %
## (Intercept) 4.0731628 5.4210051
## age 0.2379444 0.2910937
## I((age - 10) * (age < 10)) -0.4183720 -0.1930663
For people aged between 10 years or older for every year increase in the age there is a 0.26 cm increase in the triceps skinfold thickness. For every year increase in the age, for people aged less than 10 years old, there is a -0.306+0.264 = -0.042 cm change in the in the triceps skinfold thickness
Question 2.3 Using the function, fit a 4 linear basis spline models in the data with one knot each time at 5, 10, 15 and plot all three fits. Which model is the most appropriate for our data?
Reply:
kn <- c(5, 10, 15)
list_res <- pred_res <- ci_graph <- list()
for(i in 1:3){
list_res[[i]] <- lm(triceps~bs(age, degree=1, knots=kn[i]), data = triceps)
pred_res[[i]] <- predict(list_res[[i]], newdata = data.frame(age = triceps$age), se.fit = TRUE)
ci_graph[[i]] <- data.frame(
est = pred_res[[i]]$fit,
UL = pred_res[[i]]$fit + 1.96*pred_res[[i]]$se.fit,
LL = pred_res[[i]]$fit - 1.96*pred_res[[i]]$se.fit,
thr = kn[i]
)
}
ci <- do.call(rbind, ci_graph)
ci$age <- rep(triceps$age, times = 3)
ggplot(data = ci) + geom_point(data = triceps, aes(x = age, y = triceps), cex = .7, alpha = 0.2) +
geom_line(aes(x = age, y = est), cex = 1.2) +
geom_ribbon(aes(x = age, ymin = LL, ymax = UL), fill = "blue", alpha = 0.2) +
facet_grid(cols = vars(thr)) + theme_bw()
The main problem with these models is that for different threshold they results a different fit. This threshold specific volatility makes them not very attractive to use when we have no information about the threshold. We can use criteria such as the waic to get an indication about better fits.
lapply(list_res, function(X){
WAIC(X)$WAIC1 %>% return()
})
## [[1]]
## [1] 4983.829
##
## [[2]]
## [1] 4988.321
##
## [[3]]
## [1] 5015.264
with the smallest being for a knot at 5 years.
Question 2.4 Take the of the model from question 2.3 with a knot at 10 years and compare the results with the fit of questions 2.2. What do you observe?
Reply:
mod_lsplines %>% summary()
##
## Call:
## lm(formula = triceps ~ age + I((age - 10) * (age < 10)), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9714 -1.9977 -0.6263 1.2370 25.3828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.74708 0.34338 13.825 < 2e-16 ***
## age 0.26452 0.01354 19.536 < 2e-16 ***
## I((age - 10) * (age < 10)) -0.30572 0.05740 -5.326 1.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.946 on 889 degrees of freedom
## Multiple R-squared: 0.3577, Adjusted R-squared: 0.3563
## F-statistic: 247.6 on 2 and 889 DF, p-value: < 2.2e-16
list_res[[2]] %>% summary()
##
## Call:
## lm(formula = triceps ~ bs(age, degree = 1, knots = kn[i]), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9714 -1.9977 -0.6263 1.2370 25.3828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7936 0.3558 21.906 <2e-16 ***
## bs(age, degree = 1, knots = kn[i])1 -0.4013 0.4800 -0.836 0.403
## bs(age, degree = 1, knots = kn[i])2 10.6424 0.5220 20.388 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.946 on 889 degrees of freedom
## Multiple R-squared: 0.3577, Adjusted R-squared: 0.3563
## F-statistic: 247.6 on 2 and 889 DF, p-value: < 2.2e-16
The coefficients are not the same.
Question 2.5 In this question we will plot the basis function to understand the meaning of the coefficients of the fit.
b <- bs(triceps$age, degree = 1, knots = 10)
str(b)
## 'bs' num [1:892, 1:2] 0.951 0.991 0.999 0.964 0.997 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "1" "2"
## - attr(*, "degree")= int 1
## - attr(*, "knots")= num 10
## - attr(*, "Boundary.knots")= num [1:2] 0.26 51.75
## - attr(*, "intercept")= logi FALSE
b1 <- b[, 1] ## basis 1
b2 <- b[, 2] ## basis 2
data.frame(x=rep(triceps$age, times=2),
b=c(b1, b2),
basis=rep(c("basis 1", "basis 2"), each = triceps$age %>% length())) %>%
ggplot() + geom_line(aes(x=x, y=b), cex=.8) + facet_grid(cols=vars(basis)) + theme_bw()
and now run the following:
lm(triceps$triceps~b1+b2) %>% summary()
##
## Call:
## lm(formula = triceps$triceps ~ b1 + b2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9714 -1.9977 -0.6263 1.2370 25.3828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7936 0.3558 21.906 <2e-16 ***
## b1 -0.4013 0.4800 -0.836 0.403
## b2 10.6424 0.5220 20.388 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.946 on 889 degrees of freedom
## Multiple R-squared: 0.3577, Adjusted R-squared: 0.3563
## F-statistic: 247.6 on 2 and 889 DF, p-value: < 2.2e-16
list_res[[2]] %>% summary()
##
## Call:
## lm(formula = triceps ~ bs(age, degree = 1, knots = kn[i]), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9714 -1.9977 -0.6263 1.2370 25.3828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7936 0.3558 21.906 <2e-16 ***
## bs(age, degree = 1, knots = kn[i])1 -0.4013 0.4800 -0.836 0.403
## bs(age, degree = 1, knots = kn[i])2 10.6424 0.5220 20.388 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.946 on 889 degrees of freedom
## Multiple R-squared: 0.3577, Adjusted R-squared: 0.3563
## F-statistic: 247.6 on 2 and 889 DF, p-value: < 2.2e-16
What do the coefficients mean?
Reply:The values of the coefficients in the bs version of the call are essentially scaling factors, that calibrate the basis function. What we are really interested in is the combination of those to get the overall fit.
to get the estimates of the two separate lines:
RegSplineAsPiecePoly(list_res[[2]], "bs(age, degree = 1, knots = kn[i])", shift = FALSE)
## 2 piecewise polynomials of degree 1 are constructed!
## Use 'summary' to export all of them.
## The first 2 are printed below.
## 0.0107 - 0.0412 * x
## -3.05 + 0.265 * x
How can we interpret the results here? Compare the estimates with the ones retrieved in question 2.2.
Reply:The interpretation now is the natural one without needing any reparametrisation: For every year increase in the age, for people aged less than 10 years old, there is a -0.0412 cm change in the in the triceps skinfold thickness. For every year increase in the age, for people older than 10 years old, there is a 0.265 cm increase in the in the triceps skinfold thickness.
Question 3.1 Use the same dataset as before and similar with the question 2.5, use the function one knot at 10 and cubic splines
Reply:
b <- bs(triceps$age, degree = 3, knots = 10)
str(b)
## 'bs' num [1:892, 1:4] 0.565 0.662 0.656 0.59 0.652 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:4] "1" "2" "3" "4"
## - attr(*, "degree")= int 3
## - attr(*, "knots")= num 10
## - attr(*, "Boundary.knots")= num [1:2] 0.26 51.75
## - attr(*, "intercept")= logi FALSE
b1 <- b[, 1] ## basis 1
b2 <- b[, 2] ## basis 2
b3 <- b[, 3] ## basis 3
b4 <- b[, 4] ## basis 4
data.frame(x=rep(triceps$age, times=4),
b=c(b1, b2, b3, b4),
basis=rep(c("basis 1", "basis 2", "basis 3", "basis 4"),
each = triceps$age %>% length())) %>%
ggplot() + geom_line(aes(x=x, y=b), cex=.8) + facet_grid(cols=vars(basis)) + theme_bw()
Question 3.2 Use the function ad to fit cubic splines with one know at 10. What is the interpretation of the output?
Reply:
mod_bs_3_splines <- lm(triceps~bs(age, knots = 10, degree = 3), data = triceps)
mod_bs_3_splines %>% summary()
##
## Call:
## lm(formula = triceps ~ bs(age, knots = 10, degree = 3), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1907 -2.0001 -0.4568 1.4082 24.1620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.5996 0.5868 16.360 < 2e-16 ***
## bs(age, knots = 10, degree = 3)1 -4.2332 0.9285 -4.559 5.85e-06 ***
## bs(age, knots = 10, degree = 3)2 0.5109 1.0109 0.505 0.61343
## bs(age, knots = 10, degree = 3)3 9.9067 1.4876 6.659 4.81e-11 ***
## bs(age, knots = 10, degree = 3)4 3.0499 1.0791 2.826 0.00482 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.842 on 887 degrees of freedom
## Multiple R-squared: 0.3927, Adjusted R-squared: 0.3899
## F-statistic: 143.4 on 4 and 887 DF, p-value: < 2.2e-16
The coefficients scale the evaluated B-spline basis functions according to the data in order to fit the curve.
Question 3.3 Use the function and plot the fit of the above model together with 95% confidence intervals.
Reply:
x2pred <- seq(from = 1, to = 50, length.out = 200)
predict(mod_bs_3_splines, newdata = data.frame(age=x2pred), se.fit = TRUE) -> pred_bs_3_splines
data.frame(
x=x2pred,
est=pred_bs_3_splines$fit,
UL=pred_bs_3_splines$fit+1.96*pred_bs_3_splines$se.fit,
LL=pred_bs_3_splines$fit-1.96*pred_bs_3_splines$se.fit
) %>%
ggplot() + geom_point(data = triceps, aes(x = age, y = triceps), cex = .7, alpha = 0.2) +
geom_line(aes(x = x, y = est), cex = 1.2) +
geom_ribbon(aes(x = x, ymin = LL, ymax = UL), fill = "blue", alpha = 0.2) + theme_bw() -> p1
Question 3.4 Use natural splines instead and plot the fit of the above model together with 95% confidence intervals. Do you observed any differences? Tip: Include boundary knots in the function.
mod_ns_3_splines <- lm(triceps~ns(age, knots = c(1, 10, 50)), data = triceps)
mod_ns_3_splines %>% summary()
##
## Call:
## lm(formula = triceps ~ ns(age, knots = c(1, 10, 50)), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1922 -1.9987 -0.4574 1.4069 24.1611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.5707 0.5817 16.454 < 2e-16 ***
## ns(age, knots = c(1, 10, 50))1 0.5012 0.9508 0.527 0.598
## ns(age, knots = c(1, 10, 50))2 9.7536 1.3983 6.975 5.96e-12 ***
## ns(age, knots = c(1, 10, 50))3 0.3639 1.3869 0.262 0.793
## ns(age, knots = c(1, 10, 50))4 5.9575 1.0718 5.558 3.60e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.842 on 887 degrees of freedom
## Multiple R-squared: 0.3927, Adjusted R-squared: 0.39
## F-statistic: 143.4 on 4 and 887 DF, p-value: < 2.2e-16
predict(mod_ns_3_splines, newdata = data.frame(age=x2pred), se.fit = TRUE) -> pred_ns_3_splines
data.frame(
x=x2pred,
est=pred_ns_3_splines$fit,
UL=pred_ns_3_splines$fit+1.96*pred_ns_3_splines$se.fit,
LL=pred_ns_3_splines$fit-1.96*pred_ns_3_splines$se.fit
) %>%
ggplot() + geom_point(data = triceps, aes(x = age, y = triceps), cex = .7, alpha = 0.2) +
geom_line(aes(x = x, y = est), cex = 1.2) +
geom_ribbon(aes(x = x, ymin = LL, ymax = UL), fill = "blue", alpha = 0.2) + theme_bw() -> p2
p1|p2
The behavior of in the boundaries is in general less wiggly, because it assumes linearity. Here it seems that the results are very similar.
Question 3.5 Type and check what the argument does. Now use only the argument in the function and use the values 1, 10, 50, 100. What do you observe?
?ns
lapply(c(1, 10, 50, 100), function(X){
ggplot() + geom_point(data = triceps, aes(x = age, y = triceps), cex = .7, alpha = 0.2) +
geom_line(aes(x = triceps$age, y = lm(triceps~ns(age, df=X),
data = triceps)$fitted.values), cex = 1.2) +
theme_bw() + ggtitle(paste("df=",X, sep = " ")) %>% return()
}) -> p_ns
(p_ns[[1]]|p_ns[[2]])/(p_ns[[3]]|p_ns[[4]])
As we increase the degrees of freedom, them bias decreases, whereas as the degrees of freedom decrease the bias increases whereas the variance decreases.
Question 3.6 Use the function from the package and fit a penalized spline. What do you think about the fit?
Reply:
mgcv::gam(triceps~s(age), family = "gaussian", data = triceps) -> mod_psplines
plot(mod_psplines)
or
predict (mod_psplines, data.frame(age = triceps$age), se.fit = TRUE) -> pred_splines
UL <- pred_splines$fit + 1.96*pred_splines$se.fit
LL <- pred_splines$fit - 1.96*pred_splines$se.fit
ggplot( ) +
geom_point(data = triceps, aes(x=age, y=triceps), cex = 1) +
geom_line(data = triceps, aes(x=age, y=pred_splines$fit) , linewidth = .5) +
geom_ribbon(aes(ymin=LL, ymax=UL, x=triceps$age), fill="red", alpha=0.2) + theme_bw()
It looks that the penalised spline provides a very good fit to the data, capturing all the main features/trends while at the same time not being very variable.
In the fourth part of this practical you will be analyzing the HANES lb dataset. This dataset included information about the age, sex (men=0, women=1) race, location (locode), height, BMI (body mass index), Booze (categorical alcohol consumption), serum calcium (Ser.calc), setum cholesterol (Ser.chol), current smoking, smoking history, number of cigarettes per dat, lifetime pack year and follow up variables including age at death (d.age), year of death (d.year), death from any cause (d.total), death from cancer (d.cancer) and deaths from heart disease (d.heart)
As a first step we load the data and assign to a data.frame.
hanes <- read.csv(".../hanes.csv")
Question 4.1 Fit a logistic regression model for mortality (d.total) with main effects for age, sex, race, booze, smokever, and bp1sys. Interpret the output of the model (focus on the coefficient of the systolic blood pressure).
Reply:
glm(d.total~age+factor(sex)+factor(race)+factor(booze)+factor(smokever)+bp1sys,
family = "binomial", data = hanes) -> mod1
mod1 %>% summary()
##
## Call:
## glm(formula = d.total ~ age + factor(sex) + factor(race) + factor(booze) +
## factor(smokever) + bp1sys, family = "binomial", data = hanes)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1805 -0.5721 -0.2715 -0.1230 3.1536
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.598828 0.387174 -22.209 < 2e-16 ***
## age 0.109138 0.004904 22.256 < 2e-16 ***
## factor(sex)1 -0.614373 0.101563 -6.049 1.46e-09 ***
## factor(race)2 0.571150 0.126908 4.500 6.78e-06 ***
## factor(race)3 -1.037522 0.583993 -1.777 0.0756 .
## factor(booze)1 -0.181950 0.141989 -1.281 0.2000
## factor(booze)2 -0.305964 0.143113 -2.138 0.0325 *
## factor(booze)3 0.074611 0.128658 0.580 0.5620
## factor(smokever)1 0.538363 0.106398 5.060 4.19e-07 ***
## bp1sys 0.008501 0.001981 4.292 1.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4056.3 on 4174 degrees of freedom
## Residual deviance: 2916.0 on 4165 degrees of freedom
## (22 observations deleted due to missingness)
## AIC: 2936
##
## Number of Fisher Scoring iterations: 6
coefficients(mod1)["bp1sys"] %>% exp()
## bp1sys
## 1.008537
(mod1 %>% confint() %>% exp())*100
## 2.5 % 97.5 %
## (Intercept) 8.529502e-03 0.03893163
## age 1.104820e+02 112.62695379
## factor(sex)1 4.429841e+01 65.97041861
## factor(race)2 1.378759e+02 226.80855958
## factor(race)3 9.755388e+00 101.42938223
## factor(booze)1 6.305594e+01 110.04486796
## factor(booze)2 5.557751e+01 97.42188391
## factor(booze)3 8.376502e+01 138.72863865
## factor(smokever)1 1.392226e+02 211.30249975
## bp1sys 1.004631e+02 101.24660830
After accounting for age, sex, ethnicity alcohol consumption and smoking we have observed a 0.8% (0.4-1.24%) increase in the odds for dying for every unit increase in the systolic blood pressure.
Question 4.2 Now let’s examine the interaction between systolic blood pressure and sex? Is it significant? Interpret the results.
Reply:
glm(d.total~age+factor(sex)*bp1sys + factor(race)+factor(booze)+factor(smokever),
family = "binomial", data = hanes) -> mod2
mod2 %>% summary()
##
## Call:
## glm(formula = d.total ~ age + factor(sex) * bp1sys + factor(race) +
## factor(booze) + factor(smokever), family = "binomial", data = hanes)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1220 -0.5721 -0.2709 -0.1206 3.1757
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.348277 0.467203 -17.869 < 2e-16 ***
## age 0.109105 0.004905 22.246 < 2e-16 ***
## factor(sex)1 -1.118755 0.544232 -2.056 0.0398 *
## bp1sys 0.006650 0.002779 2.392 0.0167 *
## factor(race)2 0.569230 0.126921 4.485 7.29e-06 ***
## factor(race)3 -1.045539 0.584065 -1.790 0.0734 .
## factor(booze)1 -0.176383 0.142228 -1.240 0.2149
## factor(booze)2 -0.300030 0.143264 -2.094 0.0362 *
## factor(booze)3 0.082356 0.128897 0.639 0.5229
## factor(smokever)1 0.545006 0.106682 5.109 3.24e-07 ***
## factor(sex)1:bp1sys 0.003570 0.003782 0.944 0.3452
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4056.3 on 4174 degrees of freedom
## Residual deviance: 2915.1 on 4164 degrees of freedom
## (22 observations deleted due to missingness)
## AIC: 2937.1
##
## Number of Fisher Scoring iterations: 6
There is weak evidence of an interaction, implying that after accounting for age, sex, ethnicity alcohol consumption and smoking sex does not modify systolic blood pressure associated mortality risk.
anova(mod1, mod2)
The likelihood ratio test also argues that there is weak evidence supporting an interaction term.
Question 4.3 Repeat that analysis of question 3.1 using the function and a smooth function for age. Is age nonlinear? Reply:
mgcv::gam(d.total~s(age)+factor(sex) + bp1sys + factor(race)+factor(booze)+factor(smokever),
family = "binomial", data = hanes) -> mod3
mod3 %>% summary()
##
## Family: binomial
## Link function: logit
##
## Formula:
## d.total ~ s(age) + factor(sex) + bp1sys + factor(race) + factor(booze) +
## factor(smokever)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.291451 0.309045 -10.650 < 2e-16 ***
## factor(sex)1 -0.620374 0.102198 -6.070 1.28e-09 ***
## bp1sys 0.008592 0.001997 4.303 1.69e-05 ***
## factor(race)2 0.578405 0.127308 4.543 5.54e-06 ***
## factor(race)3 -1.062823 0.588344 -1.806 0.0708 .
## factor(booze)1 -0.178562 0.143185 -1.247 0.2124
## factor(booze)2 -0.303815 0.144224 -2.107 0.0352 *
## factor(booze)3 0.086606 0.129841 0.667 0.5048
## factor(smokever)1 0.559126 0.107609 5.196 2.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 2.243 2.818 532.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.284 Deviance explained = 28.3%
## UBRE = -0.29805 Scale est. = 1 n = 4175
plot(mod3)
or
predict(mod3, hanes %>%
dplyr::select(age, sex, bp1sys, race, booze, smokever) %>%
mutate(sex=0, bp1sys=mean(bp1sys, na.rm = TRUE), race=1, booze=0, smokever=0),
se.fit = TRUE) -> pred_mod3
UL <- pred_mod3$fit + 1.96*pred_mod3$se.fit
LL <- pred_mod3$fit - 1.96*pred_mod3$se.fit
ggplot() +
geom_line(data=hanes, aes(x=age, y=pred_mod3$fit)) +
geom_ribbon(data=hanes, aes(x=age, ymin=LL, ymax=UL) , fill=" blue", alpha =0.1) +
theme_bw() + ylab("log odds")
The effect looks reasonably linear.
Question 4.4 Add a smooth function for . Is it linear?
Reply:
mgcv::gam(d.total~age+factor(sex) + bp1sys + factor(race)+factor(booze)+factor(smokever)+s(ser.chol),
family = "binomial", data = hanes) -> mod4
mod4 %>% summary()
##
## Family: binomial
## Link function: logit
##
## Formula:
## d.total ~ age + factor(sex) + bp1sys + factor(race) + factor(booze) +
## factor(smokever) + s(ser.chol)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.632185 0.390614 -22.099 < 2e-16 ***
## age 0.109314 0.004955 22.062 < 2e-16 ***
## factor(sex)1 -0.630728 0.102925 -6.128 8.90e-10 ***
## bp1sys 0.008688 0.001987 4.372 1.23e-05 ***
## factor(race)2 0.557623 0.127336 4.379 1.19e-05 ***
## factor(race)3 -1.057189 0.588610 -1.796 0.0725 .
## factor(booze)1 -0.174837 0.142423 -1.228 0.2196
## factor(booze)2 -0.296272 0.143376 -2.066 0.0388 *
## factor(booze)3 0.074459 0.129010 0.577 0.5638
## factor(smokever)1 0.546635 0.106702 5.123 3.01e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(ser.chol) 2.975 3.8 10.48 0.0282 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.285 Deviance explained = 28.4%
## UBRE = -0.29845 Scale est. = 1 n = 4175
plot(mod4)
It looks that the serum cholesterol needs a more flexible fit, but overall the effect seems to be around 0.
Question 4.5 Now model d.heart instead of d.total. Is the dependence on linear?
Reply:
mgcv::gam(d.heart ~age+factor(sex) + bp1sys + factor(race)+factor(booze)+factor(smokever)+s(ser.chol),
family = "binomial", data = hanes) -> mod5
mod5 %>% summary()
##
## Family: binomial
## Link function: logit
##
## Formula:
## d.heart ~ age + factor(sex) + bp1sys + factor(race) + factor(booze) +
## factor(smokever) + s(ser.chol)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.101720 0.584437 -15.573 < 2e-16 ***
## age 0.089980 0.007508 11.984 < 2e-16 ***
## factor(sex)1 -0.750685 0.158105 -4.748 2.05e-06 ***
## bp1sys 0.011594 0.002755 4.208 2.58e-05 ***
## factor(race)2 -0.299870 0.207268 -1.447 0.14796
## factor(race)3 -1.087562 1.036394 -1.049 0.29401
## factor(booze)1 -0.412539 0.207962 -1.984 0.04729 *
## factor(booze)2 -0.541198 0.209479 -2.584 0.00978 **
## factor(booze)3 -0.329031 0.179388 -1.834 0.06663 .
## factor(smokever)1 0.524972 0.160917 3.262 0.00110 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(ser.chol) 1.499 1.868 17 0.000785 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.106 Deviance explained = 21%
## UBRE = -0.62485 Scale est. = 1 n = 4175
plot(mod5)
The effect here seems to be linear. For parsirmonity reasons we will use a linear term instead.
mgcv::gam(d.heart ~age+factor(sex) + bp1sys + factor(race)+factor(booze)+factor(smokever)+ser.chol,
family = "binomial", data = hanes) %>%
summary()
##
## Family: binomial
## Link function: logit
##
## Formula:
## d.heart ~ age + factor(sex) + bp1sys + factor(race) + factor(booze) +
## factor(smokever) + ser.chol
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.329018 0.672479 -15.360 < 2e-16 ***
## age 0.089893 0.007513 11.965 < 2e-16 ***
## factor(sex)1 -0.751705 0.157980 -4.758 1.95e-06 ***
## bp1sys 0.011539 0.002755 4.188 2.81e-05 ***
## factor(race)2 -0.294546 0.207100 -1.422 0.15496
## factor(race)3 -1.081188 1.036548 -1.043 0.29692
## factor(booze)1 -0.414018 0.207916 -1.991 0.04645 *
## factor(booze)2 -0.543646 0.209408 -2.596 0.00943 **
## factor(booze)3 -0.329922 0.179319 -1.840 0.06579 .
## factor(smokever)1 0.522181 0.160840 3.247 0.00117 **
## ser.chol 0.005563 0.001480 3.759 0.00017 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.106 Deviance explained = 20.9%
## UBRE = -0.62477 Scale est. = 1 n = 4175
In the fifth part of this tutorial, we will use the dataset of the package used throughout the lecture and examine the effect of heatwaves.
Question 5.1 What do you think a heatwave is? How should we define it?
There is no universal definition of the heatwave. The three major components for defining a heatwave is a temperature metric (min, mean, max), the intensity (a threshold once reached implying extreme heat) and the duration (day for which the temperature with excess the above-mentioned threshold).
Question 5.2 Let a heatwave be the period that we have observed temperatures higher that the 95th percentile of the overall temperature in chicago for at least 2 consecutive days. Create a new variable called heatwave that takes values 0 if it falls within the definition and 1 otherwise.
library(dlnm)
chicagoNMMAPS %>% head()
You can use the following code to define the heatwave variable:
threshold <- quantile(chicagoNMMAPS$temp, probs = 0.95)
chicagoNMMAPS %>%
dplyr::mutate(
heatwave = dplyr::case_when(temp >= threshold &
dplyr::lag(temp, 1) >= threshold &
dplyr::lag(temp, 2) >= threshold ~ 1,
TRUE ~ 0)
) -> chicagoNMMAPS
chicagoNMMAPS %>%
dplyr::mutate(
heatwave = dplyr::case_when(heatwave == 1 |
dplyr::lead(heatwave, 1) == 1 |
dplyr::lead(heatwave, 2) == 1 ~ 1,
TRUE ~ 0)
) -> chicagoNMMAPS
Quantify the effect of heatwave on resp deaths. Interpret the result. Hint: Include heatwave as linear, time as spline, month as spline and adjust for day of week and PM10
Reply:
mgcv::gam(resp ~ s(time) + s(month) + dow + pm10 + factor(heatwave),
data = chicagoNMMAPS , family = "poisson" ) -> mod_heatwave1
mod_heatwave1 %>% summary()
##
## Family: poisson
## Link function: log
##
## Formula:
## resp ~ s(time) + s(month) + dow + pm10 + factor(heatwave)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1864645 0.0146495 149.252 < 2e-16 ***
## dowMonday 0.0151681 0.0179658 0.844 0.398515
## dowTuesday 0.0217089 0.0178378 1.217 0.223596
## dowWednesday 0.0043123 0.0179379 0.240 0.810018
## dowThursday 0.0053557 0.0179001 0.299 0.764790
## dowFriday -0.0172347 0.0180559 -0.955 0.339823
## dowSaturday 0.0167863 0.0177920 0.943 0.345437
## pm10 0.0002544 0.0002646 0.962 0.336194
## factor(heatwave)1 0.1124055 0.0306566 3.667 0.000246 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(time) 7.699 8.571 85.12 <2e-16 ***
## s(month) 7.723 8.588 936.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.156 Deviance explained = 15.7%
## UBRE = 0.18595 Scale est. = 1 n = 4863
(exp(0.11 - 1.96*0.03) - 1)*100; (exp(0.11 + 1.96*0.03) - 1)*100
## [1] 5.253338
## [1] 18.38833
There is a 5-18% increase in the respiratory mortality risk during heatwave days, after accounting for day of week, temporal trends and PM10.
Question 5.3 Examine a potential lag effect of the heatwaves on respiratory mortality. Consider lags 1:10 and plot the effect. What do you observe?
Reply:
k <- 1:10
res_store <- list()
for(i in 1:length(k)){
if(i==1){
chicagoNMMAPS %>%
dplyr::mutate(
heatwave_lagged = dplyr::case_when(heatwave == 1 |
dplyr::lead(heatwave, 1) == 1 ~ 1,
TRUE ~ 0)
) -> chicagoNMMAPS
}else{
chicagoNMMAPS %>%
dplyr::mutate(
heatwave_lagged = dplyr::case_when(heatwave_lagged == 1 |
dplyr::lead(heatwave_lagged, 1) == 1 ~ 1,
TRUE ~ 0)
) -> chicagoNMMAPS
}
mgcv::gam(resp ~ s(time) + s(month) + dow + pm10 + factor(heatwave_lagged),
data = chicagoNMMAPS , family = "poisson" ) -> tmp
res_store[[i]] <- list(est = (exp(coef(tmp)["factor(heatwave_lagged)1"])-1)*100,
LL = (exp(coef(tmp)["factor(heatwave_lagged)1"] -
1.96*summary(tmp)$se["factor(heatwave_lagged)1"])-1)*100,
UL = (exp(coef(tmp)["factor(heatwave_lagged)1"] +
1.96*summary(tmp)$se["factor(heatwave_lagged)1"])-1)*100)
}
lapply(res_store, unlist) %>% do.call(rbind, .) %>% as_tibble() %>%
mutate(type =
factor(paste0("lag ", k),
levels = paste0("lag ",k))) -> plotres
colnames(plotres)[1:3] <- c("est", "LL", "UL")
ggplot(data = plotres) +
geom_point(aes(x=type, y=est)) +
geom_errorbar(aes(x=type, ymin=LL, ymax=UL, width = 0.1)) +
geom_hline(yintercept = 0, col = "red", linetype = "dotted") + theme_bw() +
ylab("% increase in respiratory mortality") + xlab("Lags") -> p1
ggplot(data = plotres) +
geom_point(aes(x=type, y=est)) +
geom_line(aes(x=type, y=est, group=1)) +
geom_ribbon(aes(x=type, ymin=LL, ymax=UL, group = 1), fill = "blue", alpha = 0.2) +
geom_hline(yintercept = 0, col = "red", linetype = "dotted") + theme_bw() +
ylab("% increase in respiratory mortality") + xlab("Lags") -> p2
p1/p2
The effect of heatwave on the lag dimension looks linear. There evidence of an effect at lag 10 is weak.
Question 5.4 Fit a distributed non-linear model. Consider a linear threshold model with the threshold being the 95th percentile of the temperature and lags 1:10.
cb.heatwave <- crossbasis(chicagoNMMAPS$temp, lag=10,
argvar=list(fun="thr", thr = threshold),
arglag=list(fun="lin"))
Use the function accounting for time as spline, month as spline and adjust for day of week and PM10. Take the of the model and interpret the slopes corresponding to the cross basis function. What is the difference of this estimate with the heatwave estimate? Reply:
mgcv::gam(resp ~ s(time) + s(month) + dow + pm10 + cb.heatwave,
data = chicagoNMMAPS, family = "poisson") -> dlnm_heatwave_model
dlnm_heatwave_model %>% summary()
##
## Family: poisson
## Link function: log
##
## Formula:
## resp ~ s(time) + s(month) + dow + pm10 + cb.heatwave
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1846901 0.0147611 148.003 < 2e-16 ***
## dowMonday 0.0146494 0.0179719 0.815 0.41500
## dowTuesday 0.0208387 0.0178513 1.167 0.24307
## dowWednesday 0.0041436 0.0179525 0.231 0.81746
## dowThursday 0.0047102 0.0179246 0.263 0.79272
## dowFriday -0.0172538 0.0180680 -0.955 0.33961
## dowSaturday 0.0163497 0.0178051 0.918 0.35848
## pm10 0.0002979 0.0002630 1.133 0.25728
## cb.heatwavev1.l1 0.0154035 0.0037894 4.065 4.81e-05 ***
## cb.heatwavev1.l2 -0.0022694 0.0006927 -3.276 0.00105 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(time) 7.687 8.563 82.12 <2e-16 ***
## s(month) 7.750 8.606 857.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.156 Deviance explained = 15.7%
## UBRE = 0.18649 Scale est. = 1 n = 4856
(exp(0.015)-1)*100
## [1] 1.511306
Overall and across all lags considered there is a 1.5% increase in the respiratory mortality risk for every 1oC increase in temperatures higher than the 95th percentile of the temperature.
There is a declining trend of the effect of temperatures higher than the 95th percentile of the temperature across the lags with the log effect to reduce by -0.002 for every subsequent lag (0.2% decrease in the % respiratory mortality risk scale).
The heatwave estimate differs from this, as the heatwave shows a risk during heatwave days (a period of extreme heat), wheres the current estimate gives increase in risk for every 1\(^o\)C increase in the temperature.
Question 5.5 Use the function . Specify and . Recall that gives the temperature values and the lags to predict. Retrieve the cumulative relative risk at 30\(^o\)C over the lags. What is the interpretation
Reply:
pred.hw <- crosspred(cb.heatwave, dlnm_heatwave_model, at=-20:30, bylag=1)
pred.hw$allRRfit["30"]
## 30
## 1.204326
pred.hw$allRRlow["30"]
## 30
## 1.02266
pred.hw$allRRhigh["30"]
## 30
## 1.418263
Relative to temperatures lower than the 95th percentile of the temperature, there is a 20% (2%, 43%) increased respiratory mortality risk across the 0-10 lags when the temperature is 30\(o\)C.
Question 5.6 Plot a 3D plot to show the lag, temperature and relative risk dimension. Also, plot 2 slices of the 3D plot, one for lag=1 and the second for temperatures=30. If interested, you can use the package to produce an interactive 3D plot.
plot(pred.hw, theta=150, phi=10, lphi=100, xlab = "Temperature (\u00B0C)", zlab="RR")
plot(pred.hw, "slice", lag=1, las = 1,
main = "Relative respiratory mortality risk: lag 1",
xlab = "Temperature (\u00B0C)", ylab = "", cex.main=1.0)
plot(pred.hw, "slice", var=30, las = 1,
main = "Relative respiratory mortality risk: Temperature 30\u00B0C",
xlab = "Lag", ylab = "", cex.main=1.0)
and with the package:
p <- plot_ly()
p <- add_surface(p,
x = 0:10, # the lags
y = -20:30, # the temperature as is on the pred.hw$matRRfit defined by at
z = pred.hw$matRRfit)
layout(p, scene = list(xaxis = list(title = "Lag"),
yaxis = list(title = "Temperature (\u00B0C)", range = c(0,30)),
zaxis = list(title = "Relative risk"))
)